Introduction

Inspiration and Motive

With the recent emergence of new online betting apps and websites, there has been an overall growing interest in the sports betting world. I myself have used some of these sports betting outlets, legally of course, and gained in interest in the use of statistics in combination with betting odds to potentially predict the winner of an NFL game.

Therefore, the main goal of this project is to attempt to build a machine learning model that has the ability to predict whether the home or visiting team will win an NFL game using a binary classification model.

Data Description

Due to my own personal interest and involvement with sports betting, I combined historical data and statistics from multiple websites across the internet on my own in order to create a data set that had enough predictors and observations to be applied to a machine learning project of this style.

This dataset, titled “nfl_szn_2018_to_2019.csv”, is a csv file that includes all of the NFL regular season and playoff games played in the 2018 - 2019 NFL season.

In terms of obtaining the data, I was able to get the historical betting odds, final score, and each game with respective date from: https://www.sportsbookreviewsonline.com/scoresoddsarchives/nfl-odds-2018-19/.

For the rest of the historical data involving previous records and percentage I used: https://www.pro-football-reference.com/years/2018/. and added to the dataset using mutate in r.

For running totals I used R to group and sum cumulatively using the R command mutate.

For injuries I went week by week and manually inputted each team’s number of injured players from: https://www.nfl.com/injuries/league/2018/REG1.

Project Outline

Let’s create a roadmap for how we will strive to reach our goal of predicting NFL games now that we have an initial grasp on the data we will be working with. This will be a multi-step process.

This process begins with cleaning and organizing our data by making necessary data adjustments. In specific we will load the data set, check for missing data and tidying the data set.

Then, we will examine how the data set and its predictor variables are constructed and related by way of its structure, summary, definition of variables, and visual EDA emphasizing plots, graphs, and correlation matrices.

Next, we will split the data into training and testing sets, make a recipe, and set folds for cross-validation.

Logistic Regression, K-Nearest Neighbors, Linear Discriminant Analysis, Quadratic Discriminant Analysis, Elastic Net - Lasso, Decision Trees, Random Forest, Boosted Trees and Support Vector Machine models will all be built and trained on the training we split. We will determine the best model for each model, analyze the results for all models, rank them by the performance metrics of accuracy and roc_auc values, and produce plots for each model. Lastly, fit the testing data set to the model with the best roc_auc value and the to model with the best accuracy value respectively. Hopefully either or both models will give us positive results and precisely predict the NFL games in the testing data set.

Exploring Our Data

Let’s begin this machine learning project by introducing our data, making any needed changes, and examining the different aspects of how our data is compiled in this specific data set. It is crucial that we understand each aspect of our data set in order to ensure we can interpret every part of our project.

Necessary Data Adjustments

Before we can even begin our project and build anything we must first load in our data and make the necessary adjustments and checks to verify that our data is not only applicable to the project but is also organized and labelled in a way that makes the data easy to manipulate and analyze once we have completed more detailed and thorough steps.

Loading the Data

To start, we will load in our data set without the counter column.

# to read in our csv data 
# also drops unneeded counter column
nfl_bet <- suppressMessages(read_csv("nfl_szn_2018_to_2019.csv", 
    col_types = cols(...1 = col_skip())))

Checking Data for Missing Values

Although I created this data set, it is always necessary to check for missing data. Missing data could create problems further down the line of our analysis, model building, and predicting.

# plot of missing values in the data
nfl_bet %>%
  vis_dat()

From the visual representation of our data there does not seem to be any missing data. Just to be safe we will check all observations as well.

sum(is.na(nfl_bet))
## [1] 0

Perfect, we have zero observations with missing data.Let’s move on to tidying and manipulating our data set.

Tidying the Data

Once again, even though I created this data set there are still variables that need to be removed and certain tidying of the data that needs to take place before we can progress with our project.

First, because we are only concerned with NFL games that finished with a winner, we will remove any NFL games that ended in a tie.

# creates a subset of the data that excludes any games that ended in a tie
nfl_bet <- subset(nfl_bet,Winner != "Tie")

Second, we will adjust the divisions so that the division column can be can be converted to a factor type.

# converts each division into region of the United States
nfl_bet <- nfl_bet %>%
  mutate(Division_V = ifelse(Division_V == "AFCNorth", "North", Division_V)) %>%
  mutate(Division_V = ifelse(Division_V == "AFCSouth", "South", Division_V)) %>%
  mutate(Division_V = ifelse(Division_V == "AFCEast", "East", Division_V)) %>%
  mutate(Division_V = ifelse(Division_V == "AFCWest", "West", Division_V)) %>%
  mutate(Division_H = ifelse(Division_H == "AFCNorth", "North", Division_H)) %>%
  mutate(Division_H = ifelse(Division_H == "AFCSouth", "South", Division_H)) %>%
  mutate(Division_H = ifelse(Division_H == "AFCEast", "East", Division_H)) %>%
  mutate(Division_H = ifelse(Division_H == "AFCWest", "West", Division_H)) %>%
  mutate(Division_V = ifelse(Division_V == "NFCNorth", "North", Division_V)) %>%
  mutate(Division_V = ifelse(Division_V == "NFCSouth", "South", Division_V)) %>%
  mutate(Division_V = ifelse(Division_V == "NFCEast", "East", Division_V)) %>%
  mutate(Division_V = ifelse(Division_V == "NFCWest", "West", Division_V)) %>%
  mutate(Division_H = ifelse(Division_H == "NFCNorth", "North", Division_H)) %>%
  mutate(Division_H = ifelse(Division_H == "NFCSouth", "South", Division_H)) %>%
  mutate(Division_H = ifelse(Division_H == "NFCEast", "East", Division_H)) %>%
  mutate(Division_H = ifelse(Division_H == "NFCWest", "West", Division_H))

Third, we will remove the Date and Game_ID columns because they are not needed predictors, nor do they help understand and analyze the data set for our concerns.

# removes the Date and Game_ID column
nfl_bet <- nfl_bet %>%
  select(-Date) %>%
  select(-GameID)

Fourth, we will clean the variable names by using clean_names() so that all of our data columns are uniform and easier to work with.

# cleans up the variable names
nfl_bet <- nfl_bet %>%
  as_tibble() %>%
  clean_names()

Fifth, we will convert all our variables, expect the NFL team names, that are character type into factor type so that they can we used as predictors.

# converts the winner variable of each game from a character to a factor
nfl_bet$winner <- nfl_bet$winner %>%
  as.factor()
# converts the same_div variable of each game from a character to a factor
nfl_bet$same_div <- nfl_bet$same_div %>%
  as.factor()
# converts the same_conf variable of each game from a character to a factor
nfl_bet$same_conf <- nfl_bet$same_conf %>%
  as.factor()
# converts the off_bye_wk_v variable of each game from a character to a factor
nfl_bet$off_bye_wk_v <- nfl_bet$off_bye_wk_v %>%
  as.factor()
# converts the off_bye_wk_h variable of each game from a character to a factor
nfl_bet$off_bye_wk_h <- nfl_bet$off_bye_wk_h %>%
  as.factor()
# converts the division_v variable of each game from a character to a factor
nfl_bet$division_v <- nfl_bet$division_v %>%
  as.factor()
# converts the division_h variable of each game from a character to a factor
nfl_bet$division_h <- nfl_bet$division_h %>%
  as.factor()
# converts the conference_h variable of each game from a character to a factor
nfl_bet$conference_h <- nfl_bet$conference_h %>%
  as.factor()
# converts the conference_v variable of each game from a character to a factor
nfl_bet$conference_v <- nfl_bet$conference_v %>%
  as.factor()

Now our data has been adjusted correctly for the project and we will save it as an rda file using the R command save() so that it can be loaded into different R files for the model building process of this project.

save(nfl_bet,file="nfl_bet.rda")

Summary and Structure of Data

Since our data has been adjusted correctly, we will examine the attributes, structure, and summary of our data.

Below is a summary of our data:

# looking at summary of data set
nfl_bet %>%
  summary()
##       week        same_div same_conf    team_v             team_h         
##  Min.   : 1.000   N:169    N: 64     Length:264         Length:264        
##  1st Qu.: 5.000   Y: 95    Y:200     Class :character   Class :character  
##  Median : 9.500                      Mode  :character   Mode  :character  
##  Mean   : 9.455                                                           
##  3rd Qu.:14.000                                                           
##  Max.   :20.000                                                           
##     final_v         final_h           ml_v               ml_h        
##  Min.   : 0.00   Min.   : 0.00   Min.   :-1200.00   Min.   :-2000.0  
##  1st Qu.:16.00   1st Qu.:17.00   1st Qu.: -142.75   1st Qu.: -300.0  
##  Median :22.00   Median :24.00   Median :  145.00   Median : -170.0  
##  Mean   :22.22   Mean   :24.42   Mean   :   80.31   Mean   : -161.4  
##  3rd Qu.:28.00   3rd Qu.:31.00   3rd Qu.:  240.00   3rd Qu.:  122.8  
##  Max.   :51.00   Max.   :54.00   Max.   : 1000.00   Max.   :  750.0  
##    injuries_v       injuries_h     division_v division_h conference_v
##  Min.   : 0.000   Min.   : 2.000   East :68   East :66   AFC:132     
##  1st Qu.: 7.000   1st Qu.: 7.000   North:62   North:64   NFC:132     
##  Median : 9.000   Median : 9.000   South:66   South:67               
##  Mean   : 9.568   Mean   : 9.818   West :68   West :67               
##  3rd Qu.:12.000   3rd Qu.:13.000                                     
##  Max.   :21.000   Max.   :22.000                                     
##  conference_h szn_win_per_v    szn_win_per_h       ud_tot_v     
##  AFC:132      Min.   :0.0000   Min.   :0.0000   Min.   : 0.000  
##  NFC:132      1st Qu.:0.3570   1st Qu.:0.3330   1st Qu.: 1.000  
##               Median :0.5000   Median :0.5000   Median : 3.000  
##               Mean   :0.5047   Mean   :0.4947   Mean   : 3.686  
##               3rd Qu.:0.6670   3rd Qu.:0.6670   3rd Qu.: 6.000  
##               Max.   :1.0000   Max.   :1.0000   Max.   :13.000  
##     ud_tot_h         fv_tot_v         fv_tot_h       ud_win_per_v  
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   :0.000  
##  1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.:0.200  
##  Median : 3.000   Median : 3.000   Median : 3.000   Median :0.400  
##  Mean   : 3.761   Mean   : 4.027   Mean   : 3.989   Mean   :0.414  
##  3rd Qu.: 6.000   3rd Qu.: 6.000   3rd Qu.: 6.000   3rd Qu.:0.500  
##  Max.   :13.000   Max.   :17.000   Max.   :16.000   Max.   :1.000  
##   ud_win_per_h    fv_win_per_v     fv_win_per_h    prev_win_rec_v  
##  Min.   :0.000   Min.   :0.0000   Min.   :0.0000   Min.   : 0.000  
##  1st Qu.:0.200   1st Qu.:0.5000   1st Qu.:0.5000   1st Qu.: 1.750  
##  Median :0.500   Median :0.5584   Median :0.5000   Median : 3.000  
##  Mean   :0.419   Mean   :0.6081   Mean   :0.5914   Mean   : 4.027  
##  3rd Qu.:0.500   3rd Qu.:0.8000   3rd Qu.:0.8000   3rd Qu.: 6.000  
##  Max.   :1.000   Max.   :1.0000   Max.   :1.0000   Max.   :14.000  
##  prev_win_rec_h   prev_loss_rec_v  prev_loss_rec_h   hc_win_per_v   
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   :0.1132  
##  1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.: 2.000   1st Qu.:0.4243  
##  Median : 3.000   Median : 3.000   Median : 3.000   Median :0.5184  
##  Mean   : 4.011   Mean   : 3.795   Mean   : 3.837   Mean   :0.5175  
##  3rd Qu.: 6.000   3rd Qu.: 6.000   3rd Qu.: 6.000   3rd Qu.:0.5860  
##  Max.   :14.000   Max.   :12.000   Max.   :11.000   Max.   :3.0377  
##   hc_win_per_h        hc_g_v           hc_g_h       prev_szn_win_per_v
##  Min.   :0.1048   Min.   : 13.00   Min.   : 12.00   Min.   :0.0000    
##  1st Qu.:0.4246   1st Qu.: 66.75   1st Qu.: 68.75   1st Qu.:0.3595    
##  Median :0.5208   Median : 92.50   Median : 94.50   Median :0.5630    
##  Mean   :0.5199   Mean   :135.18   Mean   :141.17   Mean   :0.5013    
##  3rd Qu.:0.5887   3rd Qu.:220.50   3rd Qu.:233.25   3rd Qu.:0.6250    
##  Max.   :3.0769   Max.   :455.00   Max.   :459.00   Max.   :0.8130    
##  prev_szn_win_per_h off_bye_wk_v off_bye_wk_h winner 
##  Min.   :0.0000     N:242        N:254        H:158  
##  1st Qu.:0.3750     Y: 22        Y: 10        V:106  
##  Median :0.5630                                      
##  Mean   :0.5056                                      
##  3rd Qu.:0.6250                                      
##  Max.   :0.8130

Below is the distribution of our data:

# looking at distribution of dataset
nfl_bet %>%
  str()
## tibble [264 × 38] (S3: tbl_df/tbl/data.frame)
##  $ week              : num [1:264] 1 1 1 1 1 1 1 1 1 1 ...
##  $ same_div          : Factor w/ 2 levels "N","Y": 1 1 1 1 1 2 1 1 2 1 ...
##  $ same_conf         : Factor w/ 2 levels "N","Y": 2 2 2 2 1 2 2 2 2 1 ...
##  $ team_v            : chr [1:264] "Atlanta" "SanFrancisco" "Cincinnati" "Buffalo" ...
##  $ team_h            : chr [1:264] "Philadelphia" "Minnesota" "Indianapolis" "Baltimore" ...
##  $ final_v           : num [1:264] 12 16 34 3 20 48 20 20 38 24 ...
##  $ final_h           : num [1:264] 18 24 23 47 15 40 27 27 28 27 ...
##  $ ml_v              : num [1:264] -119 230 -115 290 -145 400 225 -110 160 160 ...
##  $ ml_h              : num [1:264] -101 -280 -105 -380 125 -550 -275 -110 -190 -190 ...
##  $ injuries_v        : num [1:264] 8 5 7 3 5 7 8 5 5 11 ...
##  $ injuries_h        : num [1:264] 6 8 7 4 3 2 6 7 3 8 ...
##  $ division_v        : Factor w/ 4 levels "East","North",..: 3 4 2 1 3 3 3 3 4 4 ...
##  $ division_h        : Factor w/ 4 levels "East","North",..: 1 2 3 2 1 3 1 1 4 4 ...
##  $ conference_v      : Factor w/ 2 levels "AFC","NFC": 2 2 1 1 1 2 1 1 1 2 ...
##  $ conference_h      : Factor w/ 2 levels "AFC","NFC": 2 2 1 1 2 2 1 1 1 1 ...
##  $ szn_win_per_v     : num [1:264] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
##  $ szn_win_per_h     : num [1:264] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
##  $ ud_tot_v          : num [1:264] 0 0 0 0 0 0 0 0 0 0 ...
##  $ ud_tot_h          : num [1:264] 0 0 0 0 0 0 0 0 0 0 ...
##  $ fv_tot_v          : num [1:264] 0 0 0 0 0 0 0 0 0 0 ...
##  $ fv_tot_h          : num [1:264] 0 0 0 0 0 0 0 0 0 0 ...
##  $ ud_win_per_v      : num [1:264] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
##  $ ud_win_per_h      : num [1:264] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
##  $ fv_win_per_v      : num [1:264] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
##  $ fv_win_per_h      : num [1:264] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
##  $ prev_win_rec_v    : num [1:264] 0 0 0 0 0 0 0 0 0 0 ...
##  $ prev_win_rec_h    : num [1:264] 0 0 0 0 0 0 0 0 0 0 ...
##  $ prev_loss_rec_v   : num [1:264] 0 0 0 0 0 0 0 0 0 0 ...
##  $ prev_loss_rec_h   : num [1:264] 0 0 0 0 0 0 0 0 0 0 ...
##  $ hc_win_per_v      : num [1:264] 0.522 0.589 0.521 0.685 0.402 0.438 0.488 0.575 0.638 0.589 ...
##  $ hc_win_per_h      : num [1:264] 0.539 0.566 0.477 0.611 0.286 0.654 0.609 0.391 0.429 0.313 ...
##  $ hc_g_v            : num [1:264] 69 265 240 89 82 32 84 73 378 90 ...
##  $ hc_g_h            : num [1:264] 89 113 65 234 49 442 233 64 49 16 ...
##  $ prev_szn_win_per_v: num [1:264] 0.625 0.375 0.438 0.563 0.625 0.313 0.25 0.563 0.625 0.563 ...
##  $ prev_szn_win_per_h: num [1:264] 0.813 0.813 0.25 0.563 0.188 0.688 0.813 0.375 0.563 0.313 ...
##  $ off_bye_wk_v      : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ off_bye_wk_h      : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ winner            : Factor w/ 2 levels "H","V": 1 1 2 1 2 2 1 1 2 1 ...

Below are the dimensions of our data:

# dimensions of data
nfl_bet %>% dim()
## [1] 264  38

Thus, we can observe that our data set contains 264 observations with 38 columns.

Data Sample

Finally, we will take a look at the first 10 rows of our data to visualize what our data looks like from a table perspective.

head(nfl_bet)
## # A tibble: 6 × 38
##    week same_div same_conf team_v  team_h final_v final_h  ml_v  ml_h injuries_v
##   <dbl> <fct>    <fct>     <chr>   <chr>    <dbl>   <dbl> <dbl> <dbl>      <dbl>
## 1     1 N        Y         Atlanta Phila…      12      18  -119  -101          8
## 2     1 N        Y         SanFra… Minne…      16      24   230  -280          5
## 3     1 N        Y         Cincin… India…      34      23  -115  -105          7
## 4     1 N        Y         Buffalo Balti…       3      47   290  -380          3
## 5     1 N        N         Jackso… NYGia…      20      15  -145   125          5
## 6     1 Y        Y         TampaB… NewOr…      48      40   400  -550          7
## # ℹ 28 more variables: injuries_h <dbl>, division_v <fct>, division_h <fct>,
## #   conference_v <fct>, conference_h <fct>, szn_win_per_v <dbl>,
## #   szn_win_per_h <dbl>, ud_tot_v <dbl>, ud_tot_h <dbl>, fv_tot_v <dbl>,
## #   fv_tot_h <dbl>, ud_win_per_v <dbl>, ud_win_per_h <dbl>, fv_win_per_v <dbl>,
## #   fv_win_per_h <dbl>, prev_win_rec_v <dbl>, prev_win_rec_h <dbl>,
## #   prev_loss_rec_v <dbl>, prev_loss_rec_h <dbl>, hc_win_per_v <dbl>,
## #   hc_win_per_h <dbl>, hc_g_v <dbl>, hc_g_h <dbl>, prev_szn_win_per_v <dbl>, …

Variables of Our Data

Let’s define and explain each variable in our data set.Note that we have 1 response variable, 4 non-predictor variables, and 33 predictor variables.

Response Variable

Definition of our Response Variable:- winner: The winner of each NFL game(either H for home team or V for visiting team)

nfl_bet$winner %>%
  summary()
##   H   V 
## 158 106

We can see that the home team won 52 more games in the season than the visiting team. This will be further analyzed in the visual EDA of the response variable distribution.

Non-Predictor Variables

Looking at our dimensions, we must note that not all columns will be used as predictors in our recipe.The variables that will not be used as predictors, but are necessary to understanding the data as a whole are: - final_v: The final score of the visiting team for each game. This will not be used as predictor because it will directly tell our model the outcome of each game. - final_h: The final score of the home team for each game. This will not be used as predictor because it will directly tell our model the outcome of each game. - team_v: The visiting team of each NFL game. This will not be used as a predictor because we want reduce the cardinality of our data set. - team_h: The home team of each NFL game. This will not be used as a predictor because we want reduce the cardinality of our data set.

Predictor Variables

After denoting our response variable, the variables that will not be predictors, and removing unnecessary variables, we are left with 33 variables. Thus, we will have 33 predictor variables.Predictor Variables: . week: NFL Week Number - The week number in which NFL game is played. There are 17 NFL weeks in the regular season along with playoffs which are 3 weeks long, resulting in 20 total weeks. . same_div: Same NFL Division - This states if the two teams playing each game are in the same NFL division. There are 8 divisions in total. A value of 1 is returned if they are not in the same division and a value of 2 is returned if they are in the same division. . same_conf: Same NFL Conference - This states if the two teams playing each game are in the same NFL conference. There are 2 conferences in total. A value of 1 is returned if they are not in the same conference and a value of 2 is returned if they are in the same conference. . ml_v: Visitor Moneyline - This states the visiting NFL team’s odds to win each game. A moneyline value of < 0 means they having the better odds of winning the game. A moneyline value of > 0 means they having the better odds of losing the game. . ml_h: Home Moneyline - This states the home NFL team’s odds to win each game. A moneyline value of < 0 means they having the better odds of winning the game, or the favorite. A moneyline value of > 0 means they having the better odds of losing the game, or the underdog. . injuries_v: Visitor Injuries - This states the total number of injured players reported before each game for the visiting NFL team. . injuries_h: Home Injuries - This states the total number of injured players reported before each game for the home NFL team. . division_v: Visitor Division - This states the division in which the visiting NFL team is a member of. There are 8 unique divisions, NFC: North, South, East, & West as well as AFC: North. South, East, & West. However, for convenience, these divisions will just be referred to as North, South, East, and West. . division_h: Home Division - This states the division in which the home NFL team is a member of. There are 8 unique divisions, NFC: North, South, East, & West as well as AFC: North. South, East, & West. However, for convenience, these divisions will just be referred to as North, South, East, and West. . conference_v: Visitor Conference - This states the conference in which the visiting NFL team is a member of. There are 2 unique divisions: AFC & NFC. . conference_h: Home Conference - This states the conference in which the home NFL team is a member of. There are 2 unique divisions: AFC & NFC. . szn_win_per_v: Visitor Season Winning Percentage - This is a running total percentage that states the visiting NFL team’s winning percentage for the entire NFL season leading up to each game. For each respective team it is (Total Games won)/(Total Games Played) . szn_win_per_h: Home Season Winning Percentage - This is a running total percentage that states the home NFL team’s winning percentage for the entire NFL season leading up to each game. For each respective team it is (Total Games won)/(Total Games Played) . ud_tot_v: Visitor Underdog Total - This is a running total, leading up to each game, that states how many previous games in the entire NFL season the visiting NFL team has had better odds of losing the game at hand. . ud_tot_h: Home Underdog Total - This is a running total, leading up to each game, that states how many previous games in the entire NFL season the home NFL team has had better odds of losing the game at hand. . fv_tot_v: Visitor Favorite Total - This is a running total, leading up to each game, that states how many previous games in the entire NFL season the visiting NFL team has had better odds of winning the game at hand. . fv_tot_h: Home Favorite Total - This is a running total, leading up to each game, that states how many previous games in the entire NFL season the home NFL team has had better odds of winning the game at hand. . ud_win_per_v: Visitor Underdog Winning Percentage - This is a running total percentage, leading up to each game, that states how many previous games in the entire NFL season the visiting NFL team has had better odds of losing the game at hand and won. It is (Games won as underdog)/(Games played as underdog) . ud_win_per_h: Home Underdog Winning Percentage - This is a running total percentage, leading up to each game, that states how many previous games in the entire NFL season the home NFL team has had better odds of losing the game at hand and won. It is (Games won as underdog)/(Games played as underdog) . fv_win_per_v: Visitor Favorite Winning Percentage - This is a running total percentage, leading up to each game, that states how many previous games in the entire NFL season the visiting NFL team has had better odds of winning the game at hand and won. It is (Games won as favorite)/(Games played as favorite). fv_win_per_h: Home Favorite Winning Percentage - This is a running total percentage, leading up to each game, that states how many previous games in the entire NFL season the home NFL team has had better odds of winning the game at hand and won. It is (Games won as favorite)/(Games played as favorite) . prev_win_rec_v: Visitor Previous Win Record Before Game- This is the running total of NFL games the visiting NFL won in all previous games of the season. . prev_win_rec_h: Home Previous Win Record Before Game- This is the running total of NFL games the Home NFL won in all previous games of the season. . prev_loss_rec_v: Visitor Previous Season Loss Record Before Game - This is the running total of NFL games the visiting NFL lost in all previous games of the season. . prev_loss_rec_h: Home Previous Season Loss Record Before Game - This is the running total of NFL games the Home NFL lost in all previous games of the season. . hc_win_per_v: Visitor Head Coach Winning Percentage - This states the running total win percentage for the entire career of the visiting NFL team’s head coach leading up to each game. It is calculated by (Games Won in Career as an NFL Head Coach)/(Games Coached in Career as an NFL Head Coach) . hc_win_per_h: Home Head Coach Winning Percentage - This states the running total win percentage, for the entire career, of the home NFL team’s head coach leading up to each game. It is calculated by (Games Won in Career as an NFL Head Coach)/(Games Coached in Career as an NFL Head Coach) . hc_g_v: Visitor Head Coach Games Coached - This states the running total of games coached as an NFL head coach, for the entire career, of the visiting NFL team’s head coach leading up to each game. . hc_g_h: Home Head Coach Games Coached - This states the running total of games coached as an NFL head coach, for the entire career, of the home NFL team’s head coach leading up to each game. . prev_szn_win_per_v: Visitor Previous Season Win Percentage - This states the overall winning percentage of the visiting NFL in the previous season, 2017-2018. . prev_szn_win_per_h: Home Previous Season Win Percentage - This states the overall winning percentage of the home NFL previous season, 2017-2018. . off_bye_wk_v: Visitor Playing Off a Bye Week - This states if the visiting NFL team had a bye week (aka. did not play) the previous week before the game. This produces 1 if the visiting team did play an NFL game the previous week or 2 if the visiting team did not play an NFL game the previous week. . off_bye_wk_h: Home Playing Off a Bye Week - This states if the home NFL team had a bye week (aka. did not play) the previous week before the game. This produces 1 if the home team did play an NFL game the previous week or 2 if the home team did not play an NFL game the previous week.

Visual EDA

Game Winner Distribution

win_bar <- ggplot(nfl_bet, aes(x=winner)) +
  geom_bar(color="black",
           fill = "cyan") +
  labs(x="Game Winner",y="# of Games Won",title="Distribution of the Number of Games Won by Home and Visiting Teams")
win_bar

As we can see, we confirm from our summary computed of the response variable that 52 more games were won by the home team. That means that about 59.8% of games were won by the home team while only about 40.2% of games were won by the visiting team. This means that there is a significant ‘home field advantage’ in this NFL season which could be a crucial factor to consider when using predictive modeling and analysis.

Variable Correlation Plot

To understand the relationships among our numerical variables, we will create a correlation matrix. Following that, we’ll visualize these correlations using a heatmap, providing a clear graphical representation of how these predictors are related to each other.

nfl_bet %>%
  select(-final_v,-final_h) %>%
  select(where(is.numeric)) %>% # selecting numeric columns
  cor() %>%
  corrplot(method="color",type = "lower", diag = FALSE)

That may be a little to difficult to analyze as a heatmap, let’s try a correlation with numbers instead so we can get a better numerical understanding of the correlations in our data set.

nfl_bet %>%
  select(-final_v,-final_h) %>%
  select(where(is.numeric)) %>% # selecting numeric columns
  cor() %>%
  corrplot(method="number",type = "lower", diag = FALSE)

Now we have a better plot to look at. It looks as though our strongest correlation is a negative correlation of -0.95 between our ml_h and ml_v predictors. This makes sense as there are very few instances where both teams have the same odds to win a game against each other. Thus, if either the home or visiting team has the better odds of winning the game then the opposing team would, in turn, have the worse odds of winning the game resulting in a very strong negative correlation. In terms of our next stongest correlation, we note that: - prev_win_rec_h and fv_tot_h - prev_win_rec_v and fv_tot_v are tied for the second strongest correlation with a positive correlation of 0.87. First off, this makes sense as if either the home or visiting team have above the average number of wins for the season, they would also have the most total games designated as the favorites to win. This is due to the fact that in a majority of instances, the team with more wins will be favored to win the game.To me, the most interesting part of this correlation plot is that the number of injuries, whether it be the home or visiting team, does not seem to have a strong correlation with any other variable. I had always thought the more injuries a team had going into a game would result in either worse odds to win the game or a worse overall record.

Moneyline Dependent Bar Plots

Now we will create a few visualizations using the moneyline predictor to see how these odds influence the winner of each game.

First, we have a barplot of games won when the home Team is the favorite, or favored to win.

HV_ml_h_bar <- ggplot(nfl_bet, aes(x=winner,fill=ml_h<0)) +
  geom_bar() +
  ggtitle("Games Won When Home Team is Favored to Win") +
  labs(fill = "Favorite = Home Team",x= "Game Winner",y= "# of Game Won")
HV_ml_h_bar

Next, we have a barplot of games won when the visiting team is the favorite, favored to win.

HV_ml_v_bar <- ggplot(nfl_bet, aes(x=winner,fill=ml_v<0)) +
  geom_bar() +
  ggtitle("Games Won When Visiting Team is Favored to Win") +
  labs(fill = "Favorite = Visiting Team",x= "Game Winner",y= "# of Game Won")
HV_ml_v_bar

Moneyline Dependent Box Plot (with Jitter)

box <- ggplot(nfl_bet,aes(x=ml_h,y=winner, group=winner)) +
  geom_boxplot(aes(fill = winner))+
  theme_minimal()+
  geom_jitter(alpha=0.5) +
  theme(legend.position = "top") +
  labs(x="Moneyline Odds")
box

Week Dependent Bar Plot

Here, we have a barplot of game winners in every week of the NFL season.

week_bar <- ggplot(nfl_bet, aes(x=factor(week,levels=1:20),fill=winner)) +
  geom_bar(position=position_dodge()) +
  ggtitle("Games Winners Per Week") +
  labs(x= "Week #",y= "# of Game Won")

week_bar

Previous Team Wins Before Game

Below is a barplot of game winners when the previous number of games won is an influence.

nfl_prev_win <- nfl_bet %>%
  pivot_longer(cols = c(prev_win_rec_h, prev_win_rec_v), names_to = "prev_win_type", values_to = "prev_win")

prev_win_bar <- ggplot(nfl_prev_win, aes(x=factor(prev_win,levels=0:14),fill=winner)) +
  geom_bar(position=position_dodge()) +
  ggtitle("Games Winners vs # of Previous Team Wins Before Game") +
  labs(x= "# of Previous Team Wins Before Game",y= "# of Game Won")

prev_win_bar

Division Dependent Bar Plot

Now, with this barplot we can see the distribution of wins in each specific NFL division.

nfl_div <- nfl_bet %>%
  pivot_longer(cols = c(division_h, division_v), names_to = "division_type", values_to = "division")

div_bar <- ggplot(nfl_div, aes(x=division,fill=winner)) +
  geom_bar(position=position_dodge()) +
  ggtitle("Games Winners vs Division") +
  labs(x= "Division",y= "# of Game Won")

div_bar

Conference Dependent Bar Plot

Similar to the previous barplot, this is a distribution of games won in each specific NFL conference.

nfl_cnf <- nfl_bet %>%
  pivot_longer(cols = c(conference_h, conference_v), names_to = "conference_type", values_to = "conf")

conf_bar <- ggplot(nfl_cnf, aes(x=conf,fill=winner)) +
  geom_bar(position=position_dodge()) +
  ggtitle("Games Winners vs Conference") +
  labs(x= "Conference",y= "# of Game Won")

conf_bar

Injury Dependent Bar Plot

Next is a barplot of how many games teams won dependent on how many injuries they reported before the game.

nfl_inj <- nfl_bet %>%
  pivot_longer(cols = c(injuries_h, injuries_v), names_to = "inj_type", values_to = "inj")

inj_bar <- ggplot(nfl_inj, aes(x=factor(inj,levels=0:30),fill=winner)) +
  geom_bar(position=position_dodge()) +
  ggtitle("Games Winners vs # of Injuries") +
  labs(x= "# of Injuries",y= "# of Game Won")

inj_bar

Total Games as Underdog Dependent Bar Plot

Here, we have the number of games won dependent on the number of games they had already played as the underdog.

nfl_ud <- nfl_bet %>%
  pivot_longer(cols = c(ud_tot_h, ud_tot_v), names_to = "ud_type", values_to = "ud")

ud_bar <- ggplot(nfl_ud, aes(x=factor(ud,levels=0:30),fill=winner)) +
  geom_bar(position=position_dodge()) +
  ggtitle("Games Winners vs # of Games as Underdog") +
  labs(x= "# of Games as Underdog",y= "# of Game Won")

ud_bar

Total Games as Favorite Dependent Bar Plot

Finally, we have the number of games won dependent on the number of games they had already played as the favorite.

nfl_fv <- nfl_bet %>%
  pivot_longer(cols = c(fv_tot_h, fv_tot_v), names_to = "fv_type", values_to = "fv")

fv_bar <- ggplot(nfl_fv, aes(x=factor(fv,levels=0:30),fill=winner)) +
  geom_bar(position=position_dodge()) +
  ggtitle("Games Winners vs # of Games as Favorite") +
  labs(x= "# of Games as Favorite",y= "# of Game Won")

fv_bar

Setting Up Models

Having gained insights into some key variables that influence whether the home or visiting team wins an NFL game, we can now progress to the model-building portion of our project.

Train/Test Split

Prior to building our models, we must first split our data into training and testing subsets of the original data set in order to accurately choose a model that fits and predicts our data best. Note that before we split though, we must set our seed so that our random data split has the ability to be reproduced every time we train and test our models.

# setting seed for ability to reproduce
set.seed(2101)
# splitting the data at 70/30
nfl_split <- initial_split(nfl_bet, prop = 0.70, strata = winner)
# training subset (70)
nfl_train <- training(nfl_split)
# testing subset (30)
nfl_test <- testing(nfl_split)

Below are the dimensions of our NFL training data set: nfl_train

dim(nfl_train)
## [1] 184  38

With this training data set we will train a multitude of different classification models based on the values of the predictors and response variable in the training data set itself in combination with the recipe we will create in the next section. With this testing data set we also created in the split, we will only test the models that have performed the best on training data set. The main purpose of this split is to verify that the model we deem to be the best fit can perform accurately on data it has not seen or been trained on yet.

Below are the dimensions of our NFL testing data set: nfl_test

dim(nfl_test)
## [1] 80 38

I chose a proportion of 0.7 due to the fact that it allows for an ample amount of training data while still retaining enough data to be tested. Although there are limited observations in total, this proportion is necessary for an accurate model to be constructed.The training data subset has 184 observations.The testing data subset has 80 observations.

Building Our Recipe

Now, we will combine our predictor variables and response variable into a machine learning essential called a recipe.In relation to our data set at hand, a recipe in machine learning is essentially similar to compiling an NFL roster. An NFL roster is comprised of a multitude of players that come from all walks of life but they each have a key football component that they bring to the team. In order to create an NFL roster that is successful, one must bring in players with football components that compliment each others attributes, however there are many trials and errors that an NFL team must go through before they are able to create the overall best NFL roster with the resources available. Similar to machine learning, instead of football components we use predictor variables and instead of an NFL roster we are composing a recipe that predicts the outcome of a game. Just like in football, we will have many trials and errors with our recipe, however we will eventually use the resources we have, training data set observations, to create the most successful recipe possible.Let’s treat our recipe like an NFL team. First we must cut the players on our roster that do not service our team. In our case we are removing the variables: team_h and team_v to reduce cardinality.We are also removing the variables: final_h and final_v because having these variables in our recipe will be detrimental to our model training due to the fact that they are obvious identifiers of which team won the game.Note that because all these variables are removed when the recipe is created, when the testing data is implemented, at the end of the process, the recipe will ignore these variables as it has not been trained on them.

# building our recipe
recipe_nfl <- recipe(winner~.,data=nfl_train) %>%
  step_rm(final_h,final_v,team_v,team_h) %>%
  step_dummy(same_div) %>%  # dummy predictor on categorical variables
  step_dummy(same_conf) %>%
  step_dummy(division_h) %>% 
  step_dummy(division_v) %>%
  step_dummy(conference_h) %>%
  step_dummy(conference_v) %>%
  step_dummy(off_bye_wk_h) %>%
  step_dummy(off_bye_wk_v) %>%
  step_center(all_predictors()) %>%   # standardizing our predictors
  step_scale(all_predictors())

K-Fold Cross Validation

Next, we will use stratified sampling for cross-validation on our response variable winnwe because a more accurate and precise representation of the sample is produced when the samples are selected based on the same proportion as the population. We will use 10 folds to perform stratified cross-validation.

nfl_fold <- vfold_cv(nfl_train, v=10, strata = "winner")

Prediction Models

In this portion of the project we will take a peek into the roadmap for model building construction, which models we built using this roadmap, how we were able to efficiently create and produce each model, performance metrics we use to rank each model, details about each model, the best models we were able to compute from each model, and some tune plots and graphs from specific models.

Model Building Roadmap

In the next section we will state each of the 12 models that will be built, fit, and analyzed; however, we must first explain how these models were constructed.

As a quick side note, although our training data set only consists of 184 observations, because we have 33 predictor variables and are fitting 12 different models each individual model was created, fit, and tuned in separate R files in order to maximize efficiency and organized. Each of these R files used the feature save() to save each model in an RDA file. We will use the feature load() to produce these models in this specific R file. Note that all of the separate R files can be found attached to this project in the data folder attached to this project.

To analyze each of our model’s we must first load in each separate R file in which these models were created and trained in. We load these R files in below:

load("nfl_log_reg.rda")
load("1nfl_knn.rda")
load("nfl_lda.rda")
load("nfl_qda.rda")
load("1nfl_en.rda")
load("nfl_rf_5.rda")
load("nfl_rf_6.rda")
load("nfl_bt_5.rda")
load("nfl_bt_6.rda")
load("1nfl_dt.rda")
load("nfl_svm_lin.rda")
load("nfl_svm_rbf.rda")

The actual construction of the models in fairly simple. Below is a summarized roadmap of the process done for each model: 1. The data set is loaded into the respective R file for the model using the command load("nfl_bet.rda"). 2. The seed is set to seed(2101) for reproducing purposes. 3. The data is split into a training set nfl_train and testing set nfl_test with prop = 0.7 and strata = winner. 4. The recipe recipe_nfl is created exactly as it was in the Building Recipe section of this project.
5. Using vfold_cv the training set is set to 10 folds for stratified cross validation. 6. Then the model specific engine, with set_mode = "classification" being true for every model, is added. 7. A workflow is created with the model specific engine and the recipe recipe_nfl. (For Logistic Regression, LDA, and QDA we skip steps 8 - 10 and proceed from step 11) 8. The tune grid with model specific parameters and levels is created. 9. The model is tuned and the best tune for the model is found using select_best(). 10. Using the best tune, the workflow is finalized. 11. We fit the model with workflow to our training data set. 12. We save that fitted model (and tune if applicable) to an RDA file using save() so that the models and tunes can be efficiently accessed.

Performance Metrics

In terms of ranking and seeing how each of these models stack up against each other we will run performance evaluations using the values of accuracy and roc_auc on the estimates created from each model built using the recipe we created earlier in combination with the training data.

In terms of finding the overall best model for predicting the outcome of NFL games will we find the model with the highest accuracy value and the model with the highest roc_auc value. Using both of these models will give us a higher chance of finding a model that performs well on both the testing data set and other new, unseen data.

Accuracy

While not the overall best metric to choose which model will perform well on the testing data or new data, analyzing the accuracy values of each model can be informative due to their straightforward interpretation. Although this data is not necessarily completely balanced, for example the number of wins per home team and the number of win per visiting team, pairing accuracy with our next performance metric will give us a more clear picture of the each model’s capability.

In terms of computation, the accuracy values for classification models are computed as a proportion of the correctly predicted instances out of the total instances.

The formula is: \[ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} \]

ROC AUC

The second performance metric we will examine are the Receiver Operating Characteristic Area Under the Curve values. More commonly known as its acronym, ROC AUC, this performance evaluate is much more useful more imbalanced data sets. By evaluating each respective model’s ability to distinguish between classes, the roc_auc values provide singular measures of the quality of each model’s performance and predictions irrespective of any classification threshold.

In terms of computation, the roc_auc values for classification models are computed using integrals by calculating the area under the curve formed by plotting the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold (Specificity) settings.

On a simplified level the formulas for TPR and FPR are: \[ \text{TPR} = \frac{\text{TP}}{\text{TP+FP}} \] and \[ \text{FPR} = 1-\text{Specificity}=1-\frac{\text{TN}}{\text{TN+FP}}=\frac{\text{FP}}{\text{TN+FP}} \] where \[ \text{Specificity}=\frac{\text{TN}}{\text{TN+FP}}\]

Note that: \[ \text{TN}\] represents the number of True Negatives \[ \text{TP}\] represents the number of True Positives \[ \text{FP}\] represents the number of False Positives

Models Built

Due to our project using the classification approach there are 12 different models that we will fit our data to. These models are: . Logistic Regression . K-Nearest Neighbors . Linear Discriminant Analysis . Quadratic Discriminant Analysis . Decision Trees . Random Forests at 5 Levels . Random Forests at 6 Levels . Boosted Trees at 5 Levels . Boosted Trees at 6 Levels . Elastic Net - Lasso . Support Vector Machines - Linear Kernel . Support Vector Machines - Radial Kernel

Before we produce these models, however, let’s run through the characteristics of each model used in relation to this project.

Once again we note that for each model the mode was set to "classification".

Logistic Regression

The Logistic Regression model assumes a linear relationship between the log-odds of the dependent variable and the predictors. It models the probability that the home team wins based on the predictors.

The engine used for this model is: "glm"

K-Nearest Neighbors

The K-Nearest Neighbors (KNN) model uses previous games (neighbors) to classify the game at hand.

The engine used for this model is: "kknn".

Linear Discriminant Analysis

The Linear Discriminant Analysis (LDA) model tries to find a linear combination of predictors that best separates whether the home or visiting team won while assuming the home and visiting team share the same covariance matrix.

The engine used for this model is:"MASS".

Quadratic Discriminant Analysis

The Quadratic Discriminant Analysis (QDA) model does the same as LDA but it allows the home and visiting team to each have their own respective covariance matrices.

The engine used for this model is:"MASS".

Decision Trees

The Decision Trees model splits the data into branches based on the values of the predictors to create a tree structure of relationships and interactions between predictor variables.

The engine used for this model is:"rpart".

autoplot() of Decision Trees model tune:

autoplot(tune_tree)

best_dt <- select_best(tune_tree, metric = "roc_auc")
best_dt
## # A tibble: 1 × 2
##   cost_complexity .config              
##             <dbl> <chr>                
## 1          0.0001 Preprocessor1_Model01

The best Decision Tree model was Model 1 with cost_complexity = 0.0001000000

Below is an rpart.plot of Model 1

library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.3.2
## Loading required package: rpart
## 
## Attaching package: 'rpart'
## The following object is masked from 'package:dials':
## 
##     prune
tree_final_fit %>%
  extract_fit_engine() %>%
  rpart.plot(roundint=FALSE)

Random Forests

The Random Forests model builds numerous singular decision trees and merges them to get even better predictions while also being less prone to over fitting than singular decision trees.

For this project, due the number of predictor variables at 33, the random forests model was not able to tune at the full 33 levels. Therefore, we chose the use the square root of 33. However since this is not natural number, 5.74456264654, the levels of 5 and 6 were used to account for cover all bases.

The engine used for this model at both levels is:"ranger" with importance = "impurity".

Random Forests with 5 Levels

autoplot() of Random Forests model tune with 5 Levels

autoplot(tune_rf_5) + theme_minimal()

best_rf_5 <- select_best(tune_rf_5, metric = "roc_auc")
best_rf_5
## # A tibble: 1 × 4
##    mtry trees min_n .config               
##   <int> <int> <int> <chr>                 
## 1     1   175    12 Preprocessor1_Model031

The best random forest model with 5 levels was Model 31 with mtry = 1, min_n = 12, and trees = 12.

Random Forests with 6 Levels

autoplot() of Random Forests model tune with 6 Levels

autoplot(tune_rf_6) + theme_minimal()

best_rf_6 <- select_best(tune_rf_6, metric = "roc_auc")
best_rf_6
## # A tibble: 1 × 4
##    mtry trees min_n .config               
##   <int> <int> <int> <chr>                 
## 1     1   100    10 Preprocessor1_Model001

The best random forest model with 6 levels was Model 1 with mtry = 1, min_n = 10, and trees = 100.

Boosted Trees

The Boosted Trees model combines numerous predictors that deemed to be average or weak predictors to build a strong predictor by way of correcting the mistakes made in previous trees.

Just like the random forests models, for this project, due the number of predictor variables at 33, the boosted trees model was not able to tune at the full 33 levels. Therefore, we chose the use the square root of 33. However since this is not natural number, 5.74456264654, the levels of 5 and 6 were used to account for cover all bases.

The engine used for this model at both levels is:"xgboost".

Boosted Trees with 5 Levels

autoplot() of Boosted Trees model tune with 5 levels

autoplot(tune_bt_nfl_5) + theme_minimal()

best_bt_5 <- select_best(tune_bt_nfl_5, metric = "roc_auc")
best_bt_5
## # A tibble: 1 × 4
##    mtry trees learn_rate .config               
##   <int> <int>      <dbl> <chr>                 
## 1     1   100 0.00000316 Preprocessor1_Model011

The best boosted trees model with 5 levels was Model 11 with mtry = 1, trees = 100, and learn_rate = 3.162278e-06.

Boosted Trees with 6 Levels

autoplot() of Boosted Trees model tune with 6 levels

autoplot(tune_bt_nfl_6) + theme_minimal()

best_bt_6 <- select_best(tune_bt_nfl_6, metric = "roc_auc")
best_bt_6
## # A tibble: 1 × 4
##    mtry trees learn_rate .config               
##   <int> <int>      <dbl> <chr>                 
## 1     1   100    0.00158 Preprocessor1_Model025

The best boosted trees model with 6 levels was Model 25 with mtry = 1, trees = 100, and learn_rate = 0.001584893.

Elastic Net - Lasso

The Elastic Net - Lasso model is mainly useful when multiple predictors are correlated by way of using the regularization technique Lasso to improve the model’s prediction power.

The engine used for this model is:"glmnet".

autoplot() of Elastic Net - Lasso model tune

autoplot(tune_res_nfl)
## Warning: Transformation introduced infinite values in continuous x-axis
## Transformation introduced infinite values in continuous x-axis

best_en_nfl <- select_best(tune_res_nfl, metric = "roc_auc", penalty, mixture)
best_en_nfl
## # A tibble: 1 × 3
##   penalty mixture .config               
##     <dbl>   <dbl> <chr>                 
## 1   0.333   0.333 Preprocessor1_Model034

The best Elastic Net - Lasso model was Model 34 with penalty = 0.3333333 and mixture = 0.3333333.

Support Vector Machines - Linear Kernel

The Support Vector Machines (SVM) - Linear Kernel model is useful for linearly separable data, or can be separate using a single line, that also has a multitude of predictors.

The engine used for this model is:"kernlab".

autoplot() of SVM - Linear Kernel tune

autoplot(svm_nfl_res)

best_svm <- select_best(svm_nfl_res, metric = "roc_auc")
best_svm
## # A tibble: 1 × 2
##      cost .config              
##     <dbl> <chr>                
## 1 0.00521 Preprocessor1_Model01

The best SVM - Linear Kernel model was Model 1 with cost = 0.005212106.

Support Vector Machines - Radial Kernel

The Support Vector Machines (SVM) - Radial Kernel model expands on the definition of SVM - Linear Kernel by allowing SVM to produce non-linear boundaries that are complex which is useful for non-linear and complex relationships between predictors.

The engine used for this model is:"kernlab".

autoplot() of SVM - Radial Kernel tune

autoplot(svm_rbf_res)

best_rbf_svm <- select_best(svm_rbf_res, metric = "roc_auc")
best_rbf_svm
## # A tibble: 1 × 2
##    cost .config              
##   <dbl> <chr>                
## 1 0.108 Preprocessor1_Model03

The best SVM - Radial Kernel model was Model 3 with cost = 0.1077689.

Model Results

We can now analyze the performance metrics of each model and determine our best and worst models.

So close to the finish line!

Accuracy Values

Next, before we look further into any of the models used, let’s rank the accuracy values of all our models

# multi metric needed to find accuracy of each model's predictions
multi_metric <- metric_set(accuracy, sensitivity, specificity)

# accuracy of logistic regression model's predictions
log_acc <- augment(nfllog_fit, new_data = nfl_train) %>%
  accuracy(truth = winner, estimate = .pred_class)

# accuracy of LDA model's predictions
lda_acc <- augment(nfl_lda_fit, new_data = nfl_train) %>%
  accuracy(truth = winner, estimate = .pred_class)

# accuracy of QDA model's predictions
qda_acc <- augment(nfl_qda_fit, new_data = nfl_train) %>%
  accuracy(truth = winner, estimate = .pred_class)

# accuracy of knn model's predictions
knn_acc <- augment(nfl_knn_fit, new_data = nfl_train) %>%
  accuracy(truth = winner, estimate = .pred_class)

# accuracy of decision trees model's predictions
dt_acc <- augment(tree_final_fit, new_data = nfl_train) %>%
  accuracy(truth = winner, estimate = .pred_class)

# accuracy of elastic net model's predictions
en_acc <- augment(en_final_nfl, new_data = nfl_train) %>%
  accuracy(truth = winner, estimate = .pred_class)

# accuracy of random forests at 5 levels model's predictions
rf_5_acc <- augment(final_rf_fit_5, new_data = nfl_train) %>%
  accuracy(truth = winner, estimate = .pred_class)

# accuracy of random forests at 6 levels model's predictions
rf_6_acc <- augment(final_rf_fit_6, new_data = nfl_train) %>%
  accuracy(truth = winner, estimate = .pred_class)

# accuracy of boosted trees at 5 levels model's predictions
bt_5_acc <- augment(final_bt_fit_5, new_data = nfl_train) %>%
  accuracy(truth = winner, estimate = .pred_class)

# accuracy of boosted trees at 6 levels model's predictions
bt_6_acc <- augment(final_bt_fit_6, new_data = nfl_train) %>%
  accuracy(truth = winner, estimate = .pred_class)

# accuracy of linear SVM model's predictions
lin_svm_acc <- augment(svm_final_fit, new_data = nfl_train) %>%
  accuracy(truth = winner, estimate = .pred_class)

# accuracy of radial SVM model's predictions
r_svm_acc <- augment(svm_rbf_final_fit, new_data = nfl_train) %>%
  accuracy(truth = winner, estimate = .pred_class)



accuracies_acc <- c(log_acc$.estimate, 
                lda_acc$.estimate, 
                qda_acc$.estimate,
                knn_acc$.estimate,
                dt_acc$.estimate,
                en_acc$.estimate,
                rf_5_acc$.estimate,
                rf_6_acc$.estimate,
                bt_5_acc$.estimate,
                bt_6_acc$.estimate,
                lin_svm_acc$.estimate,
                r_svm_acc$.estimate
                )
models_acc <- c("Logistic Regression", 
            "LDA", 
            "QDA", 
            "KNN", 
            "Decision Trees",
            "Elastic Net: Lasso",
            "Random Forests: 5 Levels",
            "Random Forests: 6 Levels",
            "Boosted Trees: 5 Levels",
            "Boosted Trees: 6 Levels",
            "Linear SVM",
            "Radial SVM"
            )
results_acc <- tibble(accuracies_acc = accuracies_acc, models_acc = models_acc)
results_acc %>%
  arrange(-accuracies_acc)
## # A tibble: 12 × 2
##    accuracies_acc models_acc              
##             <dbl> <chr>                   
##  1          0.967 QDA                     
##  2          0.924 Random Forests: 6 Levels
##  3          0.897 Random Forests: 5 Levels
##  4          0.848 KNN                     
##  5          0.793 Decision Trees          
##  6          0.761 Boosted Trees: 6 Levels 
##  7          0.75  Boosted Trees: 5 Levels 
##  8          0.745 LDA                     
##  9          0.734 Logistic Regression     
## 10          0.609 Linear SVM              
## 11          0.598 Elastic Net: Lasso      
## 12          0.598 Radial SVM

The accuracy values of each model’s predicted values do give us an insight into how well our models performed. Although these values are not as important, in terms of classification purposes, as the value of each model’s roc_auc value, they are extremely informative for a surface level interpretation of each model’s performance.

Let’s also get a visual of how each model’s accuracy compares to every other model in the format of a bar plot

acc_bar <- ggplot(results_acc, aes(x = factor(models_acc, levels = models_acc[order(accuracies_acc, decreasing = TRUE)]), y = accuracies_acc, fill = models_acc)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(x = "Models", y = "Accuracy") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))  # Rotate x labels to avoid overlap
acc_bar

This visual is a little crowded, we will use a dot plot to get a better visual of the accuracy values

acc_dot <- ggplot(results_acc, aes(x = factor(models_acc, levels = models_acc[order(accuracies_acc, decreasing = TRUE)]), y = accuracies_acc)) +
  geom_point(size =10, color = "cyan") +
  geom_segment(aes(x = models_acc, 
                   xend = models_acc, 
                   y=min(accuracies_acc), 
                   yend = max(accuracies_acc)), 
               linetype = "dashed", 
               linewidth=0.5) + 
  labs(title = "Performance of Our Models", x = "Models",y="Accuracy Value") + 
  theme_minimal() +
  coord_flip()
acc_dot

Thus, we can see both visually and numerically that:

  • Our model with the lowest accuracy value was a tie between: Elastic Net - Lasso & Radial SVM

  • Our model with the median accuracy value was: Boosted Trees with 5 Levels

  • Our model with the best accuracy value was: Quadratic Discriminant Analysis

ROC_AUC Values

While the accuracy values of the predicted outcomes are an interesting metric to look at, we move on to ranking our roc_auc values which are much more important when it comes to choosing the best model without threshold interference. Below is every model’s roc_auc value ranked from best to worst performance.

# roc_auc of logistic regression model's predictions
log_roc <- augment(nfllog_fit, new_data = nfl_train) %>%
  roc_auc(winner, .pred_H)

# roc_auc of LDA model's predictions
lda_roc <- augment(nfl_lda_fit, new_data = nfl_train) %>%
  roc_auc(winner, .pred_H)

# roc_auc of QDA model's predictions
qda_roc <- augment(nfl_qda_fit, new_data = nfl_train) %>%
  roc_auc(winner, .pred_H)

# roc_auc of knn model's predictions
knn_roc <- augment(nfl_knn_fit, new_data = nfl_train) %>%
  roc_auc(winner, .pred_H)

# roc_auc of decision trees model's predictions
dt_roc <- augment(tree_final_fit, new_data = nfl_train) %>%
  roc_auc(winner, .pred_H)

# roc_auc of elastic net model's predictions
en_roc <- augment(en_final_nfl, new_data = nfl_train) %>%
  roc_auc(winner, .pred_H)

# roc_auc of random forests at 5 levels model's predictions
rf_5_roc <- augment(final_rf_fit_5, new_data = nfl_train) %>%
  roc_auc(winner, .pred_H)

# roc_auc of random forests at 6 levels model's predictions
rf_6_roc <- augment(final_rf_fit_6, new_data = nfl_train) %>%
  roc_auc(winner, .pred_H)

# roc_auc of boosted trees at 5 levels model's predictions
bt_5_roc <- augment(final_bt_fit_5, new_data = nfl_train) %>%
  roc_auc(winner, .pred_H)

# roc_auc of boosted trees at 6 levels model's predictions
bt_6_roc <- augment(final_bt_fit_6, new_data = nfl_train) %>%
  roc_auc(winner, .pred_H)

# roc_auc of linear SVM model's predictions
lin_svm_roc <- augment(svm_final_fit, new_data = nfl_train) %>%
  roc_auc(winner, .pred_H)

# roc_auc of radial SVM model's predictions
r_svm_roc <- augment(svm_rbf_final_fit, new_data = nfl_train) %>%
  roc_auc(winner, .pred_H)



roc_auc <- c(log_roc$.estimate, 
                lda_roc$.estimate, 
                qda_roc$.estimate,
                knn_roc$.estimate,
                dt_roc$.estimate,
                en_roc$.estimate,
                rf_5_roc$.estimate,
                rf_6_roc$.estimate,
                bt_5_roc$.estimate,
                bt_6_roc$.estimate,
                lin_svm_roc$.estimate,
                r_svm_roc$.estimate
                )
models <- c("Logistic Regression", 
            "LDA", 
            "QDA", 
            "KNN", 
            "Decision Trees",
            "Elastic Net: Lasso",
            "Random Forests: 5 Levels",
            "Random Forests: 6 Levels",
            "Boosted Trees: 5 Levels",
            "Boosted Trees: 6 Levels",
            "Linear SVM",
            "Radial SVM"
            )
results_roc <- tibble(roc_auc = roc_auc, models = models)
results_roc %>%
  arrange(-roc_auc)
## # A tibble: 12 × 2
##    roc_auc models                  
##      <dbl> <chr>                   
##  1   0.997 Random Forests: 6 Levels
##  2   0.996 QDA                     
##  3   0.995 Random Forests: 5 Levels
##  4   0.954 Boosted Trees: 6 Levels 
##  5   0.951 KNN                     
##  6   0.948 Boosted Trees: 5 Levels 
##  7   0.889 Radial SVM              
##  8   0.874 Decision Trees          
##  9   0.804 Logistic Regression     
## 10   0.802 LDA                     
## 11   0.759 Linear SVM              
## 12   0.665 Elastic Net: Lasso

Let’s also get a visual of how each model’s roc_auc compares to every other model in the format of a bar plot

roc_bar <- ggplot(results_roc, aes(x = factor(models, levels = models[order(roc_auc, decreasing = TRUE)]), y = roc_auc, fill = models)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(x = "Models", y = "ROC AUC") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))  # Rotate x labels to avoid overlap
roc_bar

This visual is a little crowded, we will use a dot plot to get a better visual of the roc_auc values

roc_dot <- ggplot(results_roc, aes(x = factor(models, levels = models[order(roc_auc, decreasing = TRUE)]), y = roc_auc)) +
  geom_point(size =10, color = "cyan") +
  geom_segment(aes(x = models, 
                   xend = models, 
                   y=min(roc_auc), 
                   yend = max(roc_auc)), 
               linetype = "dashed", 
               linewidth=0.5) + 
  labs(title = "Performance of Our Models", x = "Models",y="ROC_AUC Value") + 
  theme_minimal() +
  coord_flip()
roc_dot

Thus, we can see both visually and numerically that:

  • Our model with the lowest roc_auc value was: Elastic Net - Lasso

  • Our model with the median roc_auc value was: Boosted Trees with 5 Levels

  • Our model with the best roc_auc value was: Random Forests Model with 6 Levels

Worst of the Worst

Let’s take a look at our model that performed the worst, in terms of roc_auc & accuracy value, and analyze graphically why the model performed so poorly.

We will produce the metrics, compute the best model found, print the confusion matrix of the number of correct and incorrect predicted values, and analyze the ROC curve graph of the Elastic Net - Lasso model.

From these, we will be able to comprehend why the Elastic Net - Lasso model attained the lowest roc_auc and accuracy values of all models using the training data.

Our Worst Model: Elastic Net - Lasso

Let’s begin with the worst performing model at a roc_auc value of 0.6647420 and a accuracy value of 0.5978261. #### Metrics and Best Model First, we will analyze the metrics of the Elastic Net - Lasso model.

multi_metric <- metric_set(accuracy, sensitivity, specificity)

augment(en_final_nfl, new_data = nfl_train) %>%
  multi_metric(truth = winner, estimate = .pred_class)
## # A tibble: 3 × 3
##   .metric     .estimator .estimate
##   <chr>       <chr>          <dbl>
## 1 accuracy    binary         0.598
## 2 sensitivity binary         1    
## 3 specificity binary         0

We see that the metrics of this model were above average with the accuracy of predicted values at 59.78261%.

However, in comparison to the rest of the models the Elastic Net - Lasso model was dead last.

In terms of gap, the accuracy value was 0.3986112 lower than the model with the best accuracy value and 0.1521739 lower than the model with the median accuracy value.

Next, let’s see the best model for Elastic Net - Lasso that was chosen.

best_en_nfl <- select_best(tune_res_nfl, metric = "roc_auc", penalty, mixture)
best_en_nfl
## # A tibble: 1 × 3
##   penalty mixture .config               
##     <dbl>   <dbl> <chr>                 
## 1   0.333   0.333 Preprocessor1_Model034

According to our roc_auc criteria, the model with penalty = 0.3333333, mixture = 0.3333333, and .config = Preprocessor1_Model072 performs the best for this model. This is backed up by our autoplot() function in a previous section.

Confusion Matrix

Now, we produce the confusion matrix of the predicted values for the training data.

suppressMessages({
  augment(en_final_nfl, new_data = nfl_train) %>%
  conf_mat(truth = winner, estimate = .pred_class) %>%
  autoplot(type = "heatmap") + scale_fill_gradient(low = "red", high = "green")
})

We see that this model correctly predicted 112 games. With 110 games correctly predicting the home team won. And 2 games correctly predicting the visiting team won.

We also see this model incorrectly predicted 64 games. With 72 games incorrectly predicting the home team won. And 0 games incorrectly predicting the visiting team won.

Thus, we understand that the poor performance of this model is most likely attributed to its inability to predict when the visiting team won.

ROC Curve Graph

Last, take a look at the ROC curve produced from our Elastic Net - Lasso model using the training data.

augment(en_final_nfl, new_data = nfl_train) %>%
  roc_curve(winner, .pred_H) %>%
  autoplot()

We conclude that the low roc_auc is confirmed by the poor ROC curve of this model. The characteristic that makes this ROC curve poor is the fact that the ROC curve is closer to the mid line than the top left corner.

Best of the Best

Let’s take a look at our top two models that performed the best, in terms of roc_auc & accuracy value, and analyze graphically why the models performed so well.

We will produce the metrics, compute the best model found, print the confusion matrix of the number of correct and incorrect predicted values, and analyze the ROC curve graph of the Quadratic Discriminant Analysis and Random Forests with 6 Levels models.

From these, we will be able to comprehend why the Quadratic Discriminant Analysis and Random Forests with 6 Levels models attained the two highest roc_auc values of all models using the training data.

Our Best Model for Accuracy: Quadratic Discriminant Analysis

Let’s begin with the model with the best accuracy value of 0.9673913 and the second best roc_auc value of 0.9964373.

Metrics

First, we will analyze the metrics of the QDA model.

multi_metric <- metric_set(accuracy, sensitivity, specificity)

augment(nfl_qda_fit, new_data = nfl_train) %>%
  multi_metric(truth = winner, estimate = .pred_class)
## # A tibble: 3 × 3
##   .metric     .estimator .estimate
##   <chr>       <chr>          <dbl>
## 1 accuracy    binary         0.967
## 2 sensitivity binary         0.973
## 3 specificity binary         0.959

We see that this model did very well in terms of metrics with the accuracy of predicted values at 96.73913%.

Confusion Matrix

Now, we produce the confusion matrix of the predicted values for the training data.

suppressMessages({
  augment(nfl_qda_fit, new_data = nfl_train) %>%
  conf_mat(truth = winner, estimate = .pred_class) %>%
  autoplot(type = "heatmap") + scale_fill_gradient(low = "red", high = "green")
})

We see that this model correctly predicted 178 games. With 107 games correctly predicting the home team won. And 71 games correctly predicting the visiting team won.

We also see this model incorrectly predicted 6 games. With 3 games incorrectly predicting the home team won. And 3 games incorrectly predicting the visiting team won.

ROC Curve Graph

Last, take a look at the ROC curve produced from our QDA model using the training data.

augment(nfl_qda_fit, new_data = nfl_train) %>%
  roc_curve(winner, .pred_H) %>%
  autoplot()

We conclude that the high roc_auc is confirmed by the great ROC curve of this model. The characteristic that makes this ROC curve great is the fact that the ROC curve is very near the top left corner of the graph itself.

Our Best Model for ROC AUC: Random Forests Model with 6 Levels

Let’s finish with the model with the best roc_auc value of 0.9974201 and the second best accuracy value of 0.9239130.

Metrics and Best Model

First, we will analyze the metrics of the Random Forests Model with 6 Levels model.

multi_metric <- metric_set(accuracy, sensitivity, specificity)

augment(final_rf_fit_6, new_data = nfl_train) %>%
  multi_metric(truth = winner, estimate = .pred_class)
## # A tibble: 3 × 3
##   .metric     .estimator .estimate
##   <chr>       <chr>          <dbl>
## 1 accuracy    binary         0.924
## 2 sensitivity binary         1    
## 3 specificity binary         0.811

We see that this model did very well in terms of metrics with the accuracy of predicted values at 92.39130%.

Next, let’s see the best model for Random Forests Model with 6 Levels that was chosen.

best_rf_6 <- select_best(
  tune_rf_6, metric = "roc_auc"
)
best_rf_6
## # A tibble: 1 × 4
##    mtry trees min_n .config               
##   <int> <int> <int> <chr>                 
## 1     1   100    10 Preprocessor1_Model001

According to our roc_auc criteria, the model with mtry = 1, trees = 100, and min_n = 10 performs the best for this model. This is backed up by our autoplot() function in a previous section.

Confusion Matrix

Now, we produce the confusion matrix of the predicted values for the training data.

suppressMessages({
  augment(final_rf_fit_6, new_data = nfl_train) %>%
  conf_mat(truth = winner, estimate = .pred_class) %>%
  autoplot(type = "heatmap") + scale_fill_gradient(low = "red", high = "green")
})

We see that this model correctly predicted 170 games. With 110 games correctly predicting the home team won. And 60 games correctly predicting the visiting team won.

We also see this model incorrectly predicted 14 games. With 14 games incorrectly predicting the home team won. And 0 games incorrectly predicting the visiting team won.

ROC Curve Graph

Last, take a look at the ROC curve produced from our Random Forests Model with 6 Levels model using the training data.

augment(final_rf_fit_6, new_data = nfl_train) %>%
  roc_curve(winner, .pred_H) %>%
  autoplot()

We conclude that the high roc_auc is confirmed by the great ROC curve of this model. The characteristic that makes this ROC curve great is the fact that the ROC curve, while not grossly better, is even closer to the top left corner of the graph than the QDA model.

Testing Our Best of the Best

Now that we have determined our two best models Quadratic Discriminant Analysis by way of highest accuracy value and Random Forests Model with 6 Levels by way of highest roc_auc value let’s fit both models separately to the test data and review our results.

This is pretty much our machine learning Super Bowl!

Quadratic Discriminant Analysis

QDA Accuracy

First, we check how accurately this model predicted the winner of each NFL game in the testing data set.

multi_metric <- metric_set(accuracy, sensitivity, specificity)

qda.acc<-augment(nfl_qda_fit, new_data = nfl_test) %>%
  accuracy(truth = winner, estimate = .pred_class)
qda.acc
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.625

QDA ROC Curve

Before we check the roc_auc for this model with the testing data set, let’s look at the ROC curve first.

augment(nfl_qda_fit, new_data = nfl_test) %>%
  roc_curve(winner, .pred_H) %>%
  autoplot()

QDA ROC AUC

Now we check how well the model performed with the testing data in terms of roc_auc value:

qda.roc<-augment(nfl_qda_fit, new_data = nfl_test) %>%
  roc_auc(winner, .pred_H)
qda.roc
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.618

Random Forests Model with 6 Levels

Random Forests Model with 6 Levels Accuracy

First, we check how accurately this model predicted the winner of each NFL game in the testing data set.

multi_metric <- metric_set(accuracy, sensitivity, specificity)
rf6.acc<-augment(final_rf_fit_6, new_data = nfl_test) %>%
  accuracy(truth = winner, estimate = .pred_class)
rf6.acc
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary           0.7

Random Forests Model with 6 Levels ROC Curve

Before we check the roc_auc for this model with the testing data set, let’s look at the ROC curve first.

augment(final_rf_fit_6, new_data = nfl_test) %>%
  roc_curve(winner, .pred_H) %>%
  autoplot()

Random Forests Model with 6 Levels ROC AUC

Now we check how well the model performed with the testing data in terms of roc_auc value:

rf6.roc<-augment(final_rf_fit_6, new_data = nfl_test) %>%
  roc_auc(winner, .pred_H)
rf6.roc
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.736

Comparing Our Best Models

Accuracy

Let’s see which model had the better accuracy value using the testing data set.

best_mod <- c("QDA","Random Forests Model with 6 Levels")
acc_fin <- c(qda.acc$.estimate,rf6.acc$.estimate)
best_acc <- tibble(accuracy = acc_fin, models = best_mod)
best_acc %>%
  arrange(-acc_fin)
## # A tibble: 2 × 2
##   accuracy models                            
##      <dbl> <chr>                             
## 1    0.7   Random Forests Model with 6 Levels
## 2    0.625 QDA

Thus, we see that the Random Forests Model with 6 Levels performed better in terms of the accuracy value performance metric

ROC AUC

Now let’s see how the two match up when it comes to roc_auc values

best_mod <- c("QDA","Random Forests Model with 6 Levels")
roc_fin <- c(qda.roc$.estimate,rf6.roc$.estimate)
best_roc <- tibble(roc_auc = roc_fin, models = best_mod)
best_roc %>%
  arrange(-roc_fin)
## # A tibble: 2 × 2
##   roc_auc models                            
##     <dbl> <chr>                             
## 1   0.736 Random Forests Model with 6 Levels
## 2   0.618 QDA

Once again the Random Forests Model with 6 Levels performed better.

Final Results and Conclusion

Thus, we have determined that our best model in terms of both accuracy value and roc_auc value is the Random Forests Model with 6 Levels.

Although our best model of the two only had an accuracy value of 70% and a roc_auc value of 0.7356771, this is far and away a victory in my eyes.

Before we conclude, let’s take a peek at the variable importance plot to understand which predictors played the largest part in influencing the results of this model

library(vip)
## Warning: package 'vip' was built under R version 4.3.2
## 
## Attaching package: 'vip'
## The following object is masked from 'package:utils':
## 
##     vi
final_rf_fit_6 %>%
  extract_fit_parsnip() %>%
  vip(aesthetics = list(fill = "cyan", color = "black")) +
  theme_minimal()

I guess Vegas really does have the greatest influence, or at least in our model it does!

When thinking about potential improvements or adjustments to this model I do believe that there could be a better boosted trees or random forests model that can be built with a program that contains a more powerful processing system. Because we were not able to tune both of these models at the level of the total number of predictors, and instead had to settle with the square root, I do believe that we missed some models that potentially could have created models with even better roc_auc and accuracy values. Unfortunately, I attempted to install a package that used all 10 cores of my laptop, instead of the 1 core that R uses, but this package had never been updated for the newest version of R so I was not able to utilize it. The last potential improvement that I would recommend is more predictors. Even though we had a decent amount of predictors for this model, I do believe we missed some predictors, such as weather and player/team-specific statistics, that could possibly lead to the creation of a model that performs better on testing and unseen data.

In terms of an NFL game, or any sports game for that matter, the ability to predict the outcome is very difficult with the amount of variables that just simply cannot be accounted for or put into categorical and numeric values. In my opinion even though our accuracy and roc_auc values may seem low for a machine learning project the fact that we were able to produce values about 60% or 0.6 is very impressive.

Throughout this entire project I faced many challenges and found myself stuck for hours and even days trying to fix problems and build models that actually worked. While this may have been frustrating in the moment, I am immensely proud of the results that I was able to produce with this project and will effectively use this Random Forests Model with 6 Levels for my own personal enjoyment (and possibly betting wagers for fun).